255 research outputs found
On the improvement of speaker diarization by detecting overlapped speech
Simultaneous speech in meeting environment is responsible
for a certain amount of errors caused by standard speaker
diarization systems. We are presenting an overlap detection
system for far-field data based on spectral and spatial features,
where the spatial features obtained on different microphone
pairs are fused by means of principal component analysis. Detected
overlap segments are applied for speaker diarization in
order to increase the purity of speaker clusters and to recover
missed speech by assigning multiple speaker labels. Investigation
on the relationship between overlap detection properties
and diarization improvement revealed very distinct behaviour
of overlap exclusion and overlap labeling.Postprint (published version
Speaker recognition using frequency filtered spectral energies
The spectral parameters that result from filtering the
frequency sequence of log mel-scaled filter-bank energies
with a simple first or second order FIR filter have proved
to be an efficient speech representation in terms of both
speech recognition rate and computational load. Recently,
the authors have shown that this frequency filtering can
approximately equalize the cepstrum variance enhancing
the oscillations of the spectral envelope curve that are
most effective for discrimination between speakers. Even
better speaker identification results than using melcepstrum
have been obtained on the TIMIT database,
especially when white noise was added. On the other
hand, the hybridization of both linear prediction and
filter-bank spectral analysis using either cepstral
transformation or the alternative frequency filtering has
been explored for speaker verification. The combination
of hybrid spectral analysis and frequency filtering, that
had shown to be able to outperform the conventional
techniques in clean and noisy word recognition, has yield
good text-dependent speaker verification results on the
new speaker-oriented telephone-line POLYCOST
database.Peer ReviewedPostprint (published version
Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition
Speech dynamic features are routinely used in current speech recognition systems in combination with short-term (static) spectral features. Although many existing speech recognition systems do not weight both kinds of features, it seems convenient to use some weighting in order to increase the recognition accuracy of the system. In the cases that this weighting is performed, it is manually tuned or it consists simply in compensating the variances. The aim of this paper is to propose a method to automatically estimate an optimum state-dependent stream weighting in a continuous density hidden Markov model (CDHMM) recognition system by means of a maximum-likelihood based training algorithm. Unlike other works, it is shown that simple constraints on the new weighting parameters permit to apply the maximum-likelihood criterion to this problem. Experimental results in speaker independent digit recognition show an important increase of recognition accuracy.Peer ReviewedPostprint (published version
Self-supervised deep learning approaches to speaker recognition: A Ph.D. Thesis overview
Recent advances in Deep Learning (DL) for speaker recognition have improved the performance but are constrained to the need of labels for the background data, which is difficult in prac- tice. In i-vector based speaker recognition, cosine (unsuper- vised) and PLDA (supervised) are the basic scoring techniques, with a big performance gap between the two. In this thesis we tried to fill this gap without using speaker labels in several ways. We applied Restricted Boltzmann Machine (RBM) vectors for the tasks of speaker clustering and tracking in TV broadcast shows. The experiments on AGORA database show that us- ing this approach we gain a relative improvement of 12% and 11% for speaker clustering and tracking tasks, respectively. We also applied DL techniques in order to increase the discrimina- tive power of i-vectors in speaker verification task, for which we have proposed the use of autoencoder in several ways, i.e., (1) as a pre-training for a Deep Neural Network (DNN), (2) as a near- est neighbor autoencoder for i-vectors, (3) as an average pooled nearest neighbor autoencoder. The experiments on VoxCeleb database show that we gain a relative improvement of 21%, 42% and 53%, using the three systems respectively. Finally we also proposed a self-supervised end-to-end speaker verification system. The architecture is based on a Convolutional Neural Network (CNN), trained as a siamese network with multiple branches. From the results we can see that our system shows comparable performance to a supervised baselineThis work was supported by the project PID2019-107579RBI00 / AEI / 10.13039/501100011033Peer ReviewedPostprint (published version
Jitter and Shimmer measurements for speaker diarization
Jitter and shimmer voice quality features have been successfully
used to characterize speaker voice traits and detect voice pathologies.
Jitter and shimmer measure variations in the fundamental frequency
and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version
Detection and handling of overlapping speech for speaker diarization
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlation-based parameters for overlap detection on distant microphone channel data. Spatial features from dierent microphone pairs are fused by means of principal component analysis or by an approach involving a multilayer perceptron. In addition, we investigate the possibility of employing long-term prosodic information. The most suitable subset of candidate prosodic features is determined by a two-step mRMR feature selection algorithm. For segments including detected overlapping speech the speaker diarization system picks a second speaker label, and such segments are
also discarded from the model training. The proposed overlap labeling technique is integrated in the Viterbi-decoding part of the diarization algorithm.Peer ReviewedPostprint (published version
The UPC speaker verification system submitted to VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20)
This report describes the submission from Technical University of Catalonia (UPC) to the VoxCeleb Speaker Recognition Challenge (VoxSRC-20) at Interspeech 2020. The final submission is a combination of three systems. System-1 is an autoencoder based approach which tries to reconstruct similar i-vectors, whereas System-2 and -3 are Convolutional Neural Network (CNN) based siamese architectures. The siamese networks have two and three branches, respectively, where each branch is a CNN encoder. The double-branch siamese performs binary classification using cross entropy loss during training. Whereas, our triple-branch siamese is trained to learn speaker embeddings using triplet loss. We provide results of our systems on VoxCeleb-1 test, VoxSRC-20 validation and test sets.This report describes the submission from Technical University of Catalonia (UPC) to the VoxCeleb Speaker Recognition Challenge (VoxSRC-20) at Interspeech 2020. The final submission is a combination of three systems. System-1 is an autoencoder based approach which tries to reconstruct similar i-vectors, whereas System-2 and -3 are Convolutional Neural Network (CNN) based siamese architectures. The siamese networks have two and three branches, respectively, where each branch is a CNN encoder. The double-branch siamese performs binary classification using cross entropy loss during training. Whereas, our triple-branch siamese is trained to learn speaker embeddings using triplet loss. We provide results of our systems on VoxCeleb-1 test, VoxSRC-20 validation and test sets.This work was supported by the project PID2019-107579RBI00 / AEI / 10.13039/501100011033Preprin
GCC-PHAT based head orientation estimation
This work presents a novel two-step algorithm to estimate the
orientation of speakers in a smart-room environment equipped
with microphone arrays. First the position of the speaker is
estimated by the SRP-PHAT algorithm, and the time delay of
arrival for each microphone pair with respect to the detected
position is computed. In the second step, the value of the cross-
correlation at the estimated time delay is used as the fundamen-
tal characteristic from where to derive the speaker orientation. The proposed method performs consistently better than other state-of-the-art acoustic techniques with a purposely recorded database and the CLEAR head pose database.Peer ReviewedPostprint (author’s final draft
From features to speaker vectors by means of restricted Boltzmann machine adaptation
Restricted Boltzmann Machines (RBMs) have shown success in different stages of speaker recognition systems. In this paper, we propose a novel framework to produce a vector-based representation for each speaker, which will be referred to as RBM-vector. This new approach maps the speaker spectral features to a single fixed-dimensional vector carrying speaker-specific information. In this work, a global model, referred to as Universal RBM (URBM), is trained taking advantage of RBM unsupervised learning capabilities. Then, this URBM is adapted
to the data of each speaker in the development, enrolment and
evaluation datasets. The network connection weights of the adapted RBMs are further concatenated and subject to a whitening with dimension reduction stage to build the speaker vectors. The evaluation is performed on the core test condition of the NIST SRE 2006 database, and it is shown that RBM-vectors achieve 15% relative improvement in terms of EER compared to i-vectors using cosine scoring. The score fusion with i-vector attains more than 24% relative improvement. The interest of this result for score fusion yields on the fact that both vectors are produced in an unsupervised fashion and can be used instead of i-vector/PLDA approach, when no data label is available. Results obtained for RBM-vector/PLDA framework is comparable with the ones from i-vector/PLDA. Their score fusion achieves 14% relative improvement compared to i-vector/PLDA.Peer ReviewedPostprint (published version
- …